Despite rapid advances in face recognition, there remains a clear gap betweenthe performance of still image-based face recognition and video-based facerecognition, due to the vast difference in visual quality between the domainsand the difficulty of curating diverse large-scale video datasets. This paperaddresses both of those challenges, through an image to video feature-leveldomain adaptation approach, to learn discriminative video framerepresentations. The framework utilizes large-scale unlabeled video data toreduce the gap between different domains while transferring discriminativeknowledge from large-scale labeled still images. Given a face recognitionnetwork that is pretrained in the image domain, the adaptation is achieved by(i) distilling knowledge from the network to a video adaptation network throughfeature matching, (ii) performing feature restoration through synthetic dataaugmentation and (iii) learning a domain-invariant feature through a domainadversarial discriminator. We further improve performance through adiscriminator-guided feature fusion that boosts high-quality frames whileeliminating those degraded by video domain-specific factors. Experiments on theYouTube Faces and IJB-A datasets demonstrate that each module contributes toour feature-level domain adaptation framework and substantially improves videoface recognition performance to achieve state-of-the-art accuracy. Wedemonstrate qualitatively that the network learns to suppress diverse artifactsin videos such as pose, illumination or occlusion without being explicitlytrained for them.
展开▼